Improvement of predictive accuracies of functional outcomes after subacute stroke inpatient rehabilitation by machine learning models

Objectives Stepwise linear regression (SLR) is the most common approach to predicting activities of daily living at discharge with the Functional Independence Measure (FIM) in stroke patients, but noisy nonlinear clinical data decrease the predictive accuracies of SLR. Machine learning is gaining attention in the medical field for such nonlinear data. Previous studies reported that machine learning models, regression tree (RT), ensemble learning (EL), artificial neural networks (ANNs), support vector regression (SVR), and Gaussian process regression (GPR), are robust to such data and increase predictive accuracies. This study aimed to compare the predictive accuracies of SLR and these machine learning models for FIM scores in stroke patients. Methods Subacute stroke patients (N = 1,046) who underwent inpatient rehabilitation participated in this study. Only patients’ background characteristics and FIM scores at admission were used to build each predictive model of SLR, RT, EL, ANN, SVR, and GPR with 10-fold cross-validation. The coefficient of determination (R2) and root mean square error (RMSE) values were compared between the actual and predicted discharge FIM scores and FIM gain. Results Machine learning models (R2 of RT = 0.75, EL = 0.78, ANN = 0.81, SVR = 0.80, GPR = 0.81) outperformed SLR (0.70) to predict discharge FIM motor scores. The predictive accuracies of machine learning methods for FIM total gain (R2 of RT = 0.48, EL = 0.51, ANN = 0.50, SVR = 0.51, GPR = 0.54) were also better than of SLR (0.22). Conclusions This study suggested that the machine learning models outperformed SLR for predicting FIM prognosis. The machine learning models used only patients’ background characteristics and FIM scores at admission and more accurately predicted FIM gain than previous studies. ANN, SVR, and GPR outperformed RT and EL. GPR could have the best predictive accuracy for FIM prognosis.

Introduction of the FIM [32]. Therefore, machine learning algorithms have not been adequately considered in FIM prognosis research.
Gaussian process regression (GPR) can predict an output variable based on the similarities between input variables, and it is robust to noisy data [33]. SLR assumes linear or exponential models, but clinical data do not necessarily satisfy the assumption [34]. SLR and other regression algorithms build predictive models to decrease the difference between original and predicted data and predict the best value; in contrast, GPR can also predict the probabilistic functional outcome with the predicted distribution. The predicted distribution could provide a comprehensive summary that is suitable for predicting prognosis in clinical fields [34]. Therefore, GPR has been used in clinical fields in recent studies [35]. For example, GPR can accurately predict the Functional Ability Scales in head trauma patients with wearable sensors [36] and functional outcomes after stroke with magnetic resonance images [37,38].
Although previous studies have assumed a linear model for the FIM score, it is essential to consider that the FIM score is strictly nonlinear, and that clinical data are subject to noise. For example, even if the FIM score at admission is the same, it is necessary to consider a certain range in FIM scores at discharge. Therefore, we think that assuming a linear model will predict a poor fit when creating a prediction model using FIM scores. This study used SLR as a conventional regression method, and RT, EL, ANN, and SVR were used as previously reported machine learning methods. GPR was also used as a novel prognostic model for discharge FIM scores. The present study aimed to compare the predictive accuracies of SLR and machine learning methods (RT, EL, ANN, SVR, and GPR) for discharge FIM scores in stroke patients.

Study design
This observational, retrospective study was approved by the Tokyo Bay Rehabilitation Hospital's Institutional Review Board (267-2). This study was conducted in accordance with the principles of the Declaration of Helsinki [39].

Participants
A total of 1,552 subacute stroke patients were admitted to Tokyo Bay Rehabilitation Hospital between March 1 st , 2015, and September 30 th , 2019. After acute treatments, most subacute stroke patients usually transfer to rehabilitation hospitals to receive intensive rehabilitation in Japan, and Tokyo Bay Rehabilitation Hospital is one of them. The inclusion criteria were (1) the first unilateral ischemic or hemorrhagic stroke, (2) length between admission day and onset was less than 90 days (days since onset), (3) length of stay between 28 and 180 days, and (4) no history of transfer to an acute hospital. A total of 1,046 eligible patients were enrolled in the present study (Fig 1). Informed consent was obtained in the form of opt-out on the Tokyo Bay Rehabilitation hospital's website to exclude people who refused participation. All participants received conventional physical, occupational, and speech therapy for 3 hours daily. Trained nurses recorded participants' FIM scores every 2 weeks, and these data were stored in an electronic medical database.
The Japanese version of FIM (version 3.0) [7,40], which has culturally relevant modifications for some of the items, was used [41,42]. In this study, we focused on comparing the accuracies of each machine learning model. If we adopted more clinical indicators than previous research and compared the accuracies of each machine learning model, we cannot assess whether machine learning or additional clinical indicators contribute more to accuracies, so we adopted only a minimum of these basic clinical indicators.

Model development and statistical analysis
In the present study, raw FIM scores at discharge were some of the rehabilitation outcomes, and FIM motor scores, cognitive scores, and total scores at discharge were evaluated. FIM gain, defined as the change in the score between admission and discharge [43], was also examined in the present study. FIM motor gain, cognitive gain, and total gain were calculated. A previous study [44] evaluated the coefficient of determination (R 2 ) between actual FIM scores and FIM scores predicted by predictive models. A previous study also evaluated Root Mean Squared Error (RMSE) between actual and predicted FIM scores [20]. Therefore, R 2 values and RMSE of FIM motor scores, FIM cognitive scores, FIM total scores, FIM motor gain, FIM cognitive gain, and FIM total gain were compared among predictive models in the present study. A forward-backward Stepwise linear regression (SLR) was used as a conventional statistical method to predict functional outcomes in this study [44]. P-value of < 0.05 was used for the declaration of statistical significance. In addition, five machine learning algorithms, RT [25], EL [30], SVR [22], ANN [13], and GPR [45], were used. Previous studies reported the prediction of functional outcomes after stroke with ANN [19] and SVR [20]. To our best knowledge, this is the first time that GPR has been used as a novel method for predicting FIM scores.
The predictor variables were age, days since onset, and admission FIM scores (motor, cognitive, and total scores). Each prediction model was fitted to discharge FIM motor scores, discharge FIM cognitive scores, discharge FIM total scores, FIM motor gain, FIM cognitive gain, and FIM total gain. Statistical analyses were performed, and predictive models were built with MATLAB software, version 2022a (MathWorks, Natick, MA, USA).
Overlearning is widely known in machine learning, especially in ANN [46]. If the machine learning models have no restrictions to learn features, they can "memorize" all samples and improve the accuracy of training data sets or similar data sets. However, the predictive accuracy of dissimilar data sets decreases when overlearning occurs. Therefore, the prevention of overlearning is important to improve generalization performance [46]. The data were first divided into a training data set (80%) and a test data set (20%) [47] to evaluate generalization performance before learning.
The training data set was used to develop predictive models with 10-fold cross-validation. In the 10-fold cross validation [48], the training data set was randomly split into 10 groups, 9 groups were used as learning data sets, and the remaining group was used as a validation data set. This process was repeated 10 times (Fig 2). RMSE was used as a performance indicator in the present study. Hyperparameters were automatically assigned by MATLAB software through 10-fold cross-validation. After building the predictive models, each model was evaluated with test data sets. Predictive accuracies of each model were compared with adjusted R 2 and RMSE between actual and predicted values.

Results
In this study, machine learning models (RT, EL, ANN, SVR, and GPR) improved the predictive accuracies of FIM prognosis compared to SLR. The predictive performances of each model in validation and test data sets of FIM scores are presented in Table 2, and those of FIM gain are presented in Table 3. The coefficients of the SLR models are presented in S1 Table.

Prediction of FIM scores
Machine learning methods outperformed SLR to predict FIM motor and FIM total scores. Machine learning improved predictive accuracies (R 2 = 0.77-0.79, RMSE = 10.251-11.341) of FIM motor scores in validation data sets compared to SLR (R 2 = 0.67 and RMSE = 13.057). Predictive accuracies of test data sets for FIM motor scores were better than those of validation data sets except RT. GPR has the best predictive accuracies for FIM motor scores. In contrast, the predictive accuracies of FIM cognitive scores showed no differences between SLR and machine learning models. GPR has the best R 2 and RMSE as well as FIM motor scores. In FIM total scores, R 2 and RMSE of machine learning improved more than SLR. Among the machine learning, ANN, SVR, GPR tended to perform better than RT and EL. GPR had also the best predictive accuracies in all models. No overlearning was observed in our study because big differences in R 2 and RMSE between the validation and test data sets were not observed.

Prediction of FIM gain
Machine learning also improved the predictive accuracy of FIM gain more than SLR. Machine learning (R 2 = 0.41-0.50, RMSE 10.465-11.320) showed improvements over SLR (R 2 = 0.24, RMSE = 12.849) for FIM motor gain. Comparing the prognostic accuracy between the validation and test data sets, RT and EL showed a significant decrease, whereas ANN, SVR, and GPR decreased only slightly. The present results showed that ANN, SVR, and GPR outperformed SLR, RT, and EL to predict FIM motor gain. To predict FIM cognitive gain, machine learning also outperformed SLR. RT and EL had bigger differences of accuracies between validation and test data sets compared those of ANN, SVR, and GPR. Therefore, ANN, SVR, and GPR showed stable prognostic accuracy between validation and test data sets for FIM cognitive gain. For FIM total gain, machine learning showed better R 2 and RMSE than SLR. RT and EL showed larger difference in predictive accuracy between validation and test data sets than those of ANN, SVR, and GPR. GPR had the best prognostic accuracy (R 2 = 0.54, RMSE = 12.106) among the predictive models of FIM total gain.

Discussion
The present study aimed to compare the predictive accuracies of SLR and machine learning methods (RT, EL, ANN, SVR, and GPR) for discharge FIM scores in subacute stroke patients. Machine learning models outperformed SLR models to predict FIM scores and FIM gain, excluding FIM cognitive scores. The result notably suggested that machine learning models increased the predictive accuracies of FIM gain compared to SLR models.

Comparison of FIM scores between the present study and previous studies
Machine learning models potentially improve the prognostic accuracies of FIM scores at discharge compared to the linear regression model because machine learning models can adapt to complicated non-linear data. The type of model and level of regularization would affect R 2 ; therefore, the use of R 2 for model comparison with different data sets needs careful attention [49]. A previous review reported that the mean R 2 for discharge FIM motor scores was 0.65 (range 0.35 to 0.82) on multiple linear regression analysis [44]. Therefore, the present machine learning models (R 2 = 0.75-0.81) only using patients' backgrounds and FIM scores at admission had better predictive accuracies than most of the previous research. Moreover, the present study showed that the R 2 of SLR for the discharge FIM motor score was 0.67 for the validation data set and 0.70 for the test data set. This result implied that the present participants were not easier to predict FIM prognosis and were not more suitable for SLR than those in previous studies. A previous study reported that SVR had good prognostic accuracy for discharge FIM motor scores (RMSE = 26.79) with 55 participants [20]. The SVR model in the present study showed better performance (RMSE = 10.262) than in the previous study. One possible explanation for this difference was the sample size. Machine learning methods require large sample sizes to achieve the best prediction accuracy [50], and recommended sample sizes are several hundred [51]. Previous studies' sample sizes might not have been sufficient to achieve maximum accuracies; then, the accuracies could decrease because of overlearning, which is one of the problems that decrease machine learning accuracies. In contrast, the present sample size was a total of 1,046 participants, with 753 participants for learning data sets, which should be large enough, and maximum accuracy was achieved. Overlearning was negligible to predict FIM scores because the R 2 values of the test data sets did not show a large decrease from those of the validation data sets. Therefore, the present models could have better accuracies with generalization performance.

Comparison of FIM gain between the present study and previous studies
The present study also suggested that machine learning models more accurately predicted the FIM gain than SLR. A review article reported that predictive accuracies for FIM motor gain (R 2 for FIM motor gain was 0.22, range 0.08 to 0.40) were lower than those for discharge FIM motor scores [44]. R 2 for FIM motor gain of the five machine learning models ranged from 0.41 to 0.55, and it was higher than the 0.24 of SLR in the present study. To our best knowledge, our machine learning models are better than previously published models in predicting FIM motor gain.

Comparison of predictive accuracies among machine learning models
GPR showed better predictive accuracies for FIM motor and total scores, and FIM total gain than the other five models. The RMSE of GPR for FIM total scores was 13.286, the best accuracy of all models. The RMSE of GPR for FIM total gain (RMSE = 12.106) was also the best of all models. Therefore, GPR is suitable for predicting FIM total scores and FIM total gain. A previous study suggested that GPR had better predictive accuracy after spinal cord injury than SVR and SLR [34]. The present result was compatible with the previous study and is the first report of using GPR to predict FIM. In addition, GPR had the best RMSE and might have the best prediction accuracy of the three machine learning methods. In contrast, machine learning methods did not improve the predictive accuracies for FIM cognitive scores. One possible reason was that the same FIM cognitive scores had more diversity than the same motor scores, because FIM cognitive scores could not include patients' background characteristics. Since machine learning methods built the prognostic models with the same numbers as the same cases, it is thought that the accuracy of prognosis prediction decreases when there are different cases with the same numbers. Therefore, most previous research reported only the FIM motor scores or FIM total scores, excluding cognitive scores.
One of the possible reasons for the improved accuracies of machine learning (RT, EL, SVR, ANN, and GPR) over SLR is that machine learning can handle non-linear data. Neurorehabilitation data could be prone to consist of complex non-linear data, and they are also prone to noise contamination due to human error and lack of data [13]. ANN is designed to consider non-predefined and nonlinear relationships that conventional analyses cannot recognize [52,53]. SVR [22] and GPR [33] can be treated as linear models using kernel functions. Nonlinear analysis may be one of the reasons for improved prognostic accuracy. ANN [13] and SVR [54] are also robust to noise. In particular, the GPR model is characterized by its resistance to noise. It was used in electrocardiograms with much noise and improved clinical diagnosis accuracy [35], and it was also used for big data in epidemiology [55]. Among machine learning systems, SVR, ANN, and GPR are designed to be robust to noise. Therefore, they outperformed RT and EL.
The present machine learning models using only age, days since onset, and FIM scores outperformed multiple linear regression models that previous studies reported with other clinical indicators. Time and human resources at admission are usually limited; therefore, a simplified method is required to predict prognosis. In the present study, only FIM scores at admission without other clinical indicators were deliberately used to save time and to be easily used in clinical practice. Previous studies reported that the addition of functional impairment, such as the Trunk Impairment Scale [10], Stroke Impairment Assessment Set [11], comorbidity index [12], and nutritional conditions [56], to the FIM scores improved prediction accuracy. NIH Stroke scale is also well known as a good predictor in acute phase [57], but we did not consider it because we think it is not suitable for subacute stroke patients who enrolled in our study. It has been reported that the integration of conventional clinical indicators and neuroimaging biomarkers has significantly improved predictive accuracy [58], and the addition of neuroimaging to this study will be expected to further improve predictive accuracy. Further studies with specific deep learning tools for neuroimaging biomarkers have the potential to improve prediction accuracies for subacute stroke patients. Predictive accuracy is expected to be improved by incorporating clinical indicators in future studies.

Limitations of this study
The first limitation of the present study is that it was an observational, retrospective study at a single center; therefore, one should consider over-adaptation to a single center and adaptation to multiple centers in a future study. Second, the present study did not include other clinical indicators such as SIAS and TIS to save time and be easily used in practice, and these indicators could increase prognostic accuracy. Moreover, the present study did not include neuroimaging biomarkers such as acute stroke volume, arterial occlusion grade, ischemic penumbra size, etc. Further larger, multicenter studies should be conducted that include clinical indicators and imaging biomarkers to confirm these preliminary results. Third, the present study did not contain enough cases to examine deep learning, and deep learning was not considered. If the number of features and cases increases, deep learning will be considered, which is expected to have higher prediction accuracy than the machine learning models used in the present study. Fourth, machine learning models except RT could not show the contribution of each explanatory variable to improving predictive accuracies because machine learning is a black box, unlike SLR.

Conclusions
The results of the present study suggest that machine learning could improve the predictive accuracy of discharge FIM scores and FIM gain compared to SLR with the same stroke patients' data set. Machine learning models with only admission FIM scores had better predictive accuracy than previous studies with other clinical indicators; therefore, they had the potential to be easily used in daily medical practice to improve prognostic accuracy with other clinical indicators. On comparison of machine learning algorithms, ANN, SVR, and GPR outperformed RT and EL. This study is the first to have used GPR to predict FIM, and GPR had better predictive accuracies for FIM total scores and FIM total gain than other models. In addition, this is the first study with enough participants to build machine learning models for predicting FIM, and overlearning did not occur.
Supporting information S1 Table. Coefficients of SLR models. SE: standard error; Since Onset: days since onset, FIM: Functional Independence Measure. The aim of our study was to compare the predictive accuracies of a conventional stepwise linear regression (SLR) model and five machine learning models, Regression Tree, Ensemble Learning, Artificial Neural Network, Support Vector Regression, and Gaussian Process Regression. This study built the prognostic models for Activities of Daily Living (ADL) with the Functional Independence Measure (FIM), one of the methods for evaluating ADL. Discharge FIM motor scores, FIM cognitive scores, and FIM total scores were predicted. FIM gain is calculated by subtracting the scores at admission from those at the time of discharge. FIM motor gain, FIM cognitive gain, and FIM total gain were also predicted. A total of 1,046 subacute stroke patients who underwent inpatient rehabilitation participated in the present study. Patient information including age, sex, days since onset, admission and discharge FIM scores, a history of stroke, and transfer to other hospitals was gathered. Statistical analysis was performed with MATLAB software, version 2022a (The Mathworks, Natick, MA, USA). These predictive models were built with these participants' information and 10-fold cross-validation. S1 Table shows